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Abstract — We motivate the design and implementation of a 
platform-neutral compute intermediate language (Pencil) for 
productive and performance-portable accelerator programming. 

I. Introduction 

Many systems - from supercomputer installations to embed- 
ded systems-on-chip - benefit from using special-purpose ac- 
celerators which can significantly outperform general-purpose 
processors in terms of energy efficiency as well as in terms of 
execution speed. 

Software for accelerated systems, however, is currently 
written using low-level APIs, such as OpenCL and CUDA, 
which increases the cost of its development and maintenance. 
On the other hand, general-purpose programming languages 
like C, C++ and Java do not directly leverage features of 
accelerators, such as data-level parallelism, or support com- 
mon accelerator programming idioms, such as iteration space 
tiling. Furthermore, in many application domains for which 
accelerators show promise, such as image processing and 
computational fluid dynamics, it is common to program in 
domain-specific languages (DSLs). 

Compiling DSLs directly into OpenCL or CUDA is possible 
but not advisable. For example, to target accelerated platforms 
effectively the DSL implementers must develop sophisticated 
code generation and optimization techniques. Given typical 
budget constraints, they will likely limit their efforts to a set 
of techniques useful for a small number of target platforms 
{e.g. accelerated by NVIDIA GPUs), thus compromising on 
performance portability. Moreover, the implementers of dif- 
ferent DSLs will likely spend their efforts on implementing 
an overlapping set of techniques. Clearly, both teams would 
benefit if they could target an efficiently implemented inter- 
mediate language. 

Beside enhancing productivity, DSLs have the advantage of 
using high level constructs that have rich semantics. These 
constructs provide a wealth of information that enable the 
compiler to optimize and parallelize the code even for algo- 
rithms that are considered to be irregular when expressed in 
languages like C. DSL compilers keep a close control on the 
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generated code, eliminating many of the problems faced by 
general-purpose optimizing compilers. 

In this article, we present our work in progress on a 
platform-neutral compute intermediate language for DSLs 
called Pencil. We give an early overview of Pencil and 
some of the design guidelines that will help in its defini- 
tion. We show some coding rules, language extensions and 
directives that we envisage to include in PENCIL along with a 
preliminary syntax. And finally, we present two examples of 
DSLs and show how they can be expressed in Pencil. 

11. Overview of Pencil 

Pencil will be a platform-neutral intermediate language for 
multiple high performance DSLs. An optimization framework 
will take care of optimizing and parallelizing the intermediate 
language. In this paper we use the polyhedral framework [?] as 
an example of a static optimization framework. The polyhedral 
framework uses an algebraic representation and abstraction of 
programs to reason about loop transformations, allowing the 
modeUng and application of complex loop nest transforma- 
tions addressing most of the paralleUsm and locaUty-enhancing 
challenges. 

The Pencil language is meant to faciUtate automatic par- 
allelization and optimization for execution on multi-threaded 
SIMD hardware; it will thus have sequential semantics. The 
syntax presented in this work is a preUminary syntax based 
on C, and benefiting from C99 and the GNU extensions. 

Pencil will be suitably high-level to allow straightforward 
DSL-to-PENCiL compilation, but will provide direct support 
for common accelerator features and programming idioms, to 
allow downstream compilation into extremely efficient low- 
level code. In particular, its features will include extensions 
and directives (pragmas) allowing users to supply information 
about dependences and memory access patterns that may be 
difficult or impossible to analyze automatically, and a low- 
level API allowing expert programmers to exert control over 
performance-related aspects such as scheduling, vectorization, 
placement and data layout, when desired. 

The information captured by PENCIL extensions and di- 
rectives (pragmas) are similar to ^cute metadata [?], which 
have proved successful in proof-of-concept implementations. 
We plan to extend this initial work in two ways. 
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First, we will ensure that PENCIL can represent both regular 
and irregular algorithms suitable for accelerators, by systemati- 
cally studying algorithmic 'motifs' (originally called 'dwarfs') 
proposed by researchers from Berkeley [?]. ^cute metadata 
are a close fit for regular algorithms which typically have static 
iteration spaces and memory access patterns, such as dense 
linear algebra and stencil computations. We have used similar 
techniques to generate efficient OpenCL code for an irregular 
algorithm - sparse matrix-vector product - for several state- 
of-the-art sparse matrix formats suited for GPUs [?]. 

Second, we will investigate the use of directives and ex- 
tensions in cross-component optimizations, where dependence 
information associated with several computational kernels is 
collectively exploited to perform transformations to increase 
parallelism and locality. In addition to being useful as a 
compilation target, PENCIL will remain sufficiently high- 
level and structured to be used directly as an efficiency lan- 
guage, particularly for library implementers. Thus, the cross- 
component optimizer will be designed to support linking and 
transformation of a mixture of PENCIL code compiled from 
DSLs, hand-written user and library code. 

Figure [T] shows the DSL compilation flow involving PEN- 
CIL. First, a program written in a domain specific language 
is translated into PENCIL. The PENCIL design aims to make 
the task of writing a DSL— ^PENCIL compiler (the job of the 
DSL implementer) as straightforward as possible. Domain spe- 
cific optimizations are applied during this translation. Second, 
the generated Pencil code is combined with hand-written 
Pencil codes that implement specific library functions. This 
combination of codes is then optimized and parallelized (using 
the polyhedral framework for example). Finally highly special- 
ized OpenCL code is generated. The generated code is tuned 
through profile-based iterative compilation and auto-tuning. 
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A. Pencil design 

In order to guarantee the correctness of optimizations, 
compilers usually take conservative assumptions. These con- 
servative assumptions reduce the ability of the compiler to find 
optimizations. The compiler may assume, for example, that 
two pointers may alias, whereas the pointers do not actually 
alias. The fact that the two pointers do not alias is in general 
well known to the programmer and to the DSL compiler, but 
this information is not transmitted, in general, to the compiler 

To address this problem, PENCIL sets coding rules that may 
be used by DSL compilers and by expert PENCIL programmers 
in order to enhance the ability of the compiler to perform 
static code analysis. Some of these rules will be checked and 
enforced by the PENCIL compiler, while some others are left 
up to the programmer or DSL compiler. 

The current syntax of PENCIL uses C annotations and 
extensions where possible. As such, a PENCIL program, in 
the current state, retains the standard syntax and semantics 
of a C program and can be processed by an ordinary C 
compiler. Semantical additions to C make use of custom GNU 
extensions and directives. 

While designing PENCIL, we are putting a strong emphasis 
on the definition of annotation syntaxes and coding rules that 



Fig. 1: The DSL compilation flow involving PENCIL. 

may be easily lowered to compiler intermediate representations 
using attributes and built-in functions, mainly because we are 
considering an equivalent LLVM IR syntax for PENCIL. 

B. Examples of PENCIL coding rules, extensions and direc- 
tives 

1 ) Coding rules: One of the main characteristics of PENCIL 
is its restriction on pointer usage in order to eliminate aliasing. 
Pencil will accept only non-array variables to be passed by 
pointer. Array parameters must be passed using the C99 VLA 
syntax and must be qualified restrict, const, and static 
with the same syntax and semantics as in C99. For example: 

/* The following function is PENCIL-compiiant . */ 
void foo (int a [restrict const static 5]) { 

/* 'a[]' is const but its elements are not. */ 

a[0] = 1; 

/* Here const is not required. Local variables are 

not coerced to pointers. */ 
int c [2] ; 

} 
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/* Example of non PENCIL-compliant declarations. */ 
void bar (int * (d[4] ) ) { 
int *e; 

} 

Readers may recall that C99 coerces the type of a in func- 
tion foo to int*, but we require the explicit array declaration 
syntax to reinforce that PENCIL disallows pointer arithmetic. 
We may also find ways to leverage the declared array size 
information in PENCIL compilers in the future. 

A pass-by-pointer parameter should be declared in the 
receiving functions prototype as a const restrict pointer. 
These restrictions guarantee that a pointer can only point to a 
fixed memory region throughout its lifetime and that different 
pointers never point to the same memory region. 

Other coding rules that we envisage to enforce in PENCIL 
programs include the constraint that recursion (whether direct 
or indirect) and unstructured control flow (via gotos) are not 
allowed. 

2) Extensions: PENCIL pvowides access summary functions 
for describing the data access patterns of a function. This 
mechanism may be applied to any function, including those 
whose behaviors are too complex for the compiler to infer 
accurately, as well as library functions whose source code is 
not available to the PENCIL compiler and/or which internally 
uses features of C that are banned in PENCIL. In the following 
example, access declares that too performs the same data ac- 
cess as f oo_summary (array qualifiers are omitted for brevity): 

void f oo_summary ( int n, int A[n], int B[n], 
int C [n] ) 

{ 

for {int i=0; i<n; i++) { 
DEF(A[i]); USE(B[i]); MAY_DEF ( B [ i ] ) ; 

} 

if (n < 4) DEF(C[0]); // one-element def 
USE(A[n-l] ) ; 

} 

void foo (int n, int A[n], int B[n], int C[n]) 
ACCESS (foo_summary (n, A, B, C) ) 

{ 

int i ; 

for (1=0; i<n; 1++) { 
A[i] = B[i] ; 
B[rand() % n] = 4 2; 

} 

if (n < 4) C [0] = A[n-1] ; 

} 

The macros def, use, and may_def expand to built-in 
functions that modify, use, or may modify their argument, 
respectively, but which are guaranteed not to be accidentally 
optimized out in upstream compiler passes. The actual ac- 
cesses summarized by the function are defined by the array 
elements traversed along the execution of the summary func- 
tion. Control flow and C instructions are only meant to drive 
the enumeration of these accesses. Since these summaries are 
meant to be processed by a static analyzer, non-affine control 
flow may lead to further discrepancies between may-write and 
must-write access sets. For example, the result of such a static 
analysis could take the form of three distinct access relations, 
mapping each iteration of the summarized function call and/or 
its parameters to a set of may-write, must-write, and read 



accesses respectively. 

3) Directives: PENCIL uses directives inspired by OpenMP, 
OpenACC and advanced vectorizing compilers. 

The restrictions presented in the previous section simplify 
data and control dependence analysis, which gives PENCIL 
compilers a boost in loop optimizations. When this falls short 
of providing the compiler with necessary static information, 
however, dependence information can be explicitly supplied 
as directives. One such directive is 

tpragma pencil independent [ {li, . . . ,ln)] 

The list li, ... ,1,1 indicates the labeled statements on which 
the loop independence is guaranteed. A statement that appears 
in an independent clause is assumed not to have any loop- 
carried dependence with any other statement in the loop. If this 
list is omitted then all statements in the loop body are free of 
dependences carried by the annotated loop. In the following 
example: 

tpragma pencil independent 

for (int i=0; i<N; i++) 
A[t[i]]++; 

different iterations of the loop may write to the same array 
location. The write location depends on the value of t [ i ] . 
In order to parallelize the loop, the compiler needs to make 
sure that there is no loop-carried dependence, but proving this 
property is not possible at compile time. Thus, the compiler 
considers conservatively that there may be a dependence 
between the different iterations and the loop is not parallelized. 
If the DSL compiler or the expert PENCIL programmer know 
that all values of t [ i ] are different then she should insert 
an independent pragma to indicate that different iterations 
of the loop are independent. This will not only enable the 
parallelization of the loop, but also provide valuable static 
information to other loop transformations and optimizations. 

Unlike the OpenMP parallel pragma, it is possible to 
use the independent pragma on while loops to indicate that 
there is no dependence between the different iterations of the 
while loop. It is up to the compiler to use this information to 
optimize the code. Moreover, the independent pragma allows 
fine grain code description as its scope may be limited to only 
one statement in the loop body. 

Pencil also defines a reduction directive equivalent to the 
reduction directive defined in OpenMP and OpenACC. It has 
the following syntax: 

tpragma pencil reduction (operator : scalars) 

Note that PENCIL does not compete with OpenMP, actually 
Pencil complements OpenMP and, in general, the coding 
rules defined by PENCIL are useful for compiler optimizations 
even if they are used outside PENCIL. 

All in all, this feature set provides a language whose 
overall semantics is sequential but which places conventions 
and restrictions that increase the static information available 
to the compiler, thus enabling the compiler to do more ag- 
gressive loop nest optimizations and parallelization. Although 
the language is sequential, the information about parallelism 
available at the DSL level is not lost, because this information 
is expressed in PENCIL through directives like independent 
which indicates the absence of dependences in a given loop. 
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Any lower level compiler can use this information not only to 
parallelize the loop but to apply other optimizations as well. 
Expressing the absence of dependences is more powerful than 
expressing only parallelism. 

III. Examples of DSL translation into Pencil 

This section provides examples of DSLs that can be mapped 
into Pencil, and benefit from the optimizations provided by 
Pencil compilers, including polyhedral compilation methods. 
Some DSLs are mostly designed for programmer productivity, 
and their compilation flow typically combines specific passes 
for abstraction penalty removal with more generic optimization 
passes. Such DSLs should immediately benefit from Pencil 
with minor modifications to their compilation flow. Other 
DSLs involve a lot of domain-specific information available 
for compile-time optimizations. Since a large number of these 
optimizations are actually generic ones, expressible as loop 
transformations and storage mapping choices, PENCIL will 
also contribute to the simplification of their tool flow. 

In all of the following examples, memory access informa- 
tion should be used to annotate functions called from within 
the kernels. Moreover, the coding rules are mandatory to 
enable a precise dependence analysis. 

A. 0P2 library 

OP2 [?] is a state-of-the-art library for parallelizing un- 
structured mesh computations. It restricts the computational 
kernel's data-access pattern, simplifying dependence analyses 
and facilitating task decomposition, scheduling, and data lay- 
out. While a great deal of 0P2's innovations lies in its efficient 
backend implementations, it is noteworthy how PENCIL cap- 
tures 0P2's most important restrictions. Let us illustrate this 
with the following program using OP2's C-n- binding, adapted 
from [?]. Functions named with the op_ prefix constitute 
OP2's API. For the sake of conciseness we have omitted 
string parameters that are used for dynamic type checking and 
diagnostics. 

void kernel (double *edge, 

double *cellO, double *celll) 

{ 

*celll += *edge; *cellO *edge; 

} 

void main_loop {int ncells, int nedges, 
int *edge_to_cells, 
double *edge_data, 
double *cell_data) 

{ 

op_set cells = op_decl_set (ncells) ; 
op_set edges = op_decl_set (nedges) ; 
op_map pecell = op_decl_map (edges, cells, 2, 

edge_to_cells ) ; 
op_dat dcells = op_decl_dat (cells, 1, cell_data) ; 
op_dat dedges = op_decl_dat (edges, 1, edge_data) ; 

op_par_loop (kernel, edges, 

op_arg (dedges, -1, OP_ID, 1, OP_READ) , 
op_arg (dcells, 0, pecell, 1, OP_INC) , 
op_arg (dcells, 1, pecell, 1, OP_INC) ) ; 

} 

In this example, we assume a 2D mesh with ncells cells, 
numbered (or indexed) from through ncells-1, and a 



total of nedges edges also numbered from 0. We ignore 
boundary edges for simplicity and assume that each edge falls 
between exactly two cells. The input edge_to_cells is 
a l-to-2 mapping that indicates which edge touches which 
cells - the edge with index i touches the cells with indices 
edge_to_cells [2*1] and edge_to_cells [2*1 + 1] . 
Every edge or cell carries one double-precision floating point 
data, specified by edge_data or cell_data, respectively. 
The main computational kernel adds to each cell all data 
coming in from its edges; we wish to do this for all cells. 

The first six lines of main_loop ( ) just communicate this 
setup to OP2. op_decl_set ( ) is used to declare the set of 
cells and the set of edges, while op_decl_map () defines 
the relationship between them using edge_to_cells. The 
argument 2 indicates to OP2 that this is a l-to-2 mapping. 
Conceptually pecell is just a copy of edge_to_cells, 
made opaque so that OP2 is not constrained by the layout 
or location of edge_to_cells. Finally op_dec_dat () 
attaches data to the cells and edges. 

The most interesting part is op_par_loop (), which is 
conceptually equivalent to the following plain C loop: 

for (int 1=0; i < nedges; ++i) 
kernel (&dedges[i], 

&dcells[pecell[2*i]], 
sdcells [pecell [2*1+1] ] ) ; 

In words, op_par_loop ( ) iterates over the indices of 
edges, calling a kernel on the data associated with each 
index. The three calls to op_arg ( ) are used to indicate the 
arguments of kernel ( ) , and to describe how each argument 
is being accessed. 
For example 

op_arg (dcells, 0, pecell, 1, OP_INC) 

tells op_par_loop ( ) to that the first argument of Iternel 
is dcells; and that the index used to access dcells is 
calculated by looking up pecell at the loop index i and 
adding the offset 0; the number of data elements passed to 
the kernel is 1 starting at the translated index. OP_ID is 
used to indicate that the loop index should be used directly 
to address the data. The last argument OP_INC is a hint on 
how the kernel function accesses this data; it means the data 
is the subject of a global-reduction sum, as seen for eel 10 
and celll in the above example. The other possible hints 
are OP_READ, OP_WRITE, and OP_RW. For the last two, 
the kernel code must ensure that no data conflict is possible 
between different iterations. 

It should be clear that OP2's semantics is correctly cap- 
tured in Pencil by translation to a for loop like the one 
above, with the caveat that kernel must be either inlined or 
modified to 

void kernel (double edge [ ] , int ie, 

double cellO[], int 10, 

double celll [], int 11) 
{ celll [11] += edge[ie]; 
cellO [10] += edge [ie] ; } 

(because PENCIL does not allow pointers). The translated for 
loop is legal PENCIL. The other parts - the first six lines of 
main_loop () - simply reify and constrain the input data, 
which is unnecessary in PENCIL. 
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This is not surprising, as OP2 is a more aggressively 
restricted DSL than PENCIL. The more interesting fact is 
how much of OP2's static information can be captured in 
Pencil. The single greatest benefit from OP2's programming 
model is probably elimination of pointer analysis. This is 
built into PENCIL. Of OP2's access hints, OP_INC can be 
expressed with a reduction pragma, the conflict-freedom 
requirement of OP_WRITE/OP_RW can be expressed with 
#pragma independent, and OP_READ should be infer- 
able from the source code. 

One aspect of OP2 that is not currently captured explicitly 
by Pencil is allowing the un-associativity of floating point 
arithmetic to compromise bit-wise reproducibility of results. 
The example program above suffers from the problem that 
parallelizing the loop in any way compromises numerical 
precision to some extent. This is a long-standing and well- 
known issue, often handled by providing a switch or pragma 
to allow trading precision for efficiency. We plan to follow 
this well-accepted practice. 

B. Delite/OptiML 

OptiML [?] is a DSL for machine learning built on top of 
DeUte [?], a framework for creating implicitly parallel DSLs. 

An OptiML program is actually a program generator em- 
bedded in Scala. It uses meta-programming to construct a 
symbolic representation of the DSL program as it is executed. 
Each program expression, such as if (c) a else b, constructs 
an IR node when the program is run. Instead of using a control 
flow graph (CFG) for the different statements with fixed basic 
blocks, Delite uses a "sea of nodes" [?] as an IR representation. 
Nodes are connected with respect to their (input and control) 
dependences but are allowed to float freely otherwise. 

The Delite IR provides several operators. A given DSL may 
use a subset of these operators and may also extend existing 
operators to create new ones. 

OptiML programs operate on the high-level mutable 
types Vector[T] and Matrix [T] and provides 4 main con- 
trol structures: sum, vector construction, untiiconverged and 
gradient. We enumerate these structures and show how they 
can be mapped to Pencil. 

• sum: expresses generic summations over an indexed 
range. It calculates ^ f{i) where f{i) is a user-defined 
function. For example 

val X = sum(0,100) { i => exp(i) } 

calculates 

X = exp(O) + exp(l) + exp(2) + ... 

sum is implemented as a parallel tree-reduce and can be 
translated into PENCIL using a for loop and a reduction 
directive. 

X = exp(O) 

#pragma pencil reduction {+:x) 

for (i=l; i<=100; i++) 
X += exp ( i ) ; 

• vector construction: implemented as a parallel map in 
Delite. It has the following form 

val my_vector = (0: : end) { i => } 



and can be translated into PENCIL using a simple for 
loop. There is no need in this case to use the independent 
pragma, as the loop nest is always affine and the under- 
lying optimization tools that operate on PENCIL will be 
able to recover the parallelism and generate a parallel 
code. 

for (i=0; i<=end; i++) 
my_vector[i] = 0; 

• untiiconverged: an iterative control structure that iter- 
ates until reaching a convergence criterion. Each iteration 
produces a value, and the loop converges when the 
difference between values in consecutive iterations falls 
below a supplied threshold. This control structure is 
implemented in PENCIL as a sequential loop. 

• gradient descent: is a specialized version of untiicon- 
verged that implements the gradient descent algorithm 
for exponential family models. It provides batch and 
stochastic variants. The batch variant uses a parallel 
algorithm and thus it can be mapped, in PENCIL, into 
a for loop annotated with the independent pragma to 
indicate that there is no loop carried dependence. The 
stochastic variant is mapped into a sequential for loop 
as the algorithm is not parallel. 

IV. Conclusion 

We proposed Pencil, a platform-neutral compute interme- 
diate language for productive and performance-portable ac- 
celerator programming. This intermediate language facilitates 
the design and implementation of high-level programming 
environments for parallel architectures. In particular, we be- 
lieve Pencil reduces the complexity and costs of exploiting 
heterogeneous systems. 
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